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ABSTRACT 


The UK Catalysis Hub (UKCH) is designing a virtual research environment to support data processing and 
analysis, the Catalysis Research Workbench (CRW). The development of this platform requires identifying the 
processing and analysis needs of the UKCH members and mapping them to potential solutions. This paper 
presents a proposal for a demonstrator to analyse the use of scientific workflows for large scale data processing. 
The demonstrator provides a concrete target to promote further discussion of the processing and analysis 
needs of the UKCH community. In this paper, we will discuss the main requirements for data processing 
elicited and the proposed adaptations that will be incorporated in the design of the CRW and how to 
integrate the proposed solutions with existing practices of the UKCH. The demonstrator has been used in 
discussion with researchers and in presentations to the UKCH community, generating increased interest and 
motivating further development. 


t Corresponding author: Abraham Nieva de la Hidalga (Email: nievadelahidalgaa@cardiff.ac.uk; ORCID: 0000-0001-7348- 
7612). 
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1. INTRODUCTION 


Experimental and computational simulation techniques developed to understand the nature of materials 
and their practical applications in catalysis research rely on the use of data for building and validating 
complex models (such as the example in Figure 1). The UK Catalysis Hub (UKCH) enables cutting-edge 
research in catalytic science, by facilitating access to state-of-the-art resources and expertise. UKCH 
provides access to well equiped laboratories, central facilities provided by the Science and Technology 
Facilities Council (STFC) and offers expert advice for processing and analysis of the data produced from 
experiments and theoretical models. 
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Figure 1. Diagram of an in-situ XAFS analysis experiment [11]. The target of this proposal are the processes and 
outputs after XAFS microspectroscopy®, on the lower rightmost branch of the experimental process using ATHENA 
and ARTEMIS for processing and analyses of XAFS data. 


UKCH researchers use advanced processing and analysis software such as Mantid [3], DAWN [4], 
Larch [29], and Demeter [33] to handle the data produced by their research projects. These tools allow 


scientists to process and analyze data interactively. Additionally, each scientist has a choice of analysis 
software such as MATLAB, R, and Excel, to further analyze data and to format results for publishing. STFC 


® Performed at Diamond Beamline B22: Multimode InfraRed Imaging And Microspectroscopy (MIRIAM). 
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facilities (CLF [7, 8], Diamond [13, 20], and ISIS [15, 21]) operate 24 hours a day and have the capacity 
to performing thousands of readings which produce large datasets that require further processing and 
analysis. Naturally, the time employed in processing and analyzing data increases with the size of the 
datasets. Moreover, new experiment proposals aiming to collect even larger quantities of data push the 
boundaries of the processing capacity of analysis tools [37]. 


Having in mind the current and future requirements for processing and analysis of increasing data 
volumes, the UKCH started designing a virtual research environment, the Catalysis Research Workbench 
(CRW). The development of this platform requires identifying the processing and analysis needs of the 
UKCH members and mapping them to potential solutions. In this requirements collection phase, the UKCH 
implemented a workflow demonstrator to foster further discussion and analysis of the requirements for the 
CRW. The goal of the demonstrator is to introduce the concept of managed scientific workflows and discuss 
their integration in the day-to-day practices of UKCH researchers. The demonstrator has been used in 
discussion with researchers and in presentations to the UKCH community, generating increased interest 
and motivating further development. 


2. RELATED WORK 


The use of software prototypes is an established software engineering practice [6, 22, 28, 34, 35, 36]. 
A demonstrator is a type of functional prototype which is used in proof-of-concept studies to support the 
illustration of complex design proposals to a wide range of system stakeholders. The demonstrator can be 
presented by the designer who describes the details of the implementation while performing a specific set 
of tasks, often scripted, and then requests feedback from the user community. 


There are various cases in which prototypes (and demonstrators) have been used successfully to present 
implementation proposals and to refine and prioritize user requirements. Prototyping has been used for 
multiple purposes such as the description of architechtural decisions, discussion of interface design, and 
presentation of new functionalities. Davis et al. use a Web service-based e-science demonstrator to explain 
the architectural design for a text mining platform [10]. Klampanos et al. describe the implementation of 
an information registry prototype to demonstrate how it can enable collaboration and ensure consistency 
across the distributed infrastructure for Dispel and dispel4py [23]. Leong et al. present the implementation 
of three use cases to demonstrate the feasibility and benefits of applying a cloud driven approach to 
supercomputing ecosystems [25] for large scale experimental facilities. 


In the workflow domain, Goble et al. used a demonstrator to present the design principles and functionality 
of the myGrid middleware suite, to facilitate the work of bioinformaticians [17]. Nieva et al. describe the 
use of different prototypes in the review of alternative designs for a web interface for the Taverna workflow 
management system [28]. Watkins et al. present Workspace, a scientific workflow system that includes 
rapid prototyping features enabling the testing of different components and configurations during the design 
of complex workflows [38]. 
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3. PROBLEM FORMULATION 


Large scale research facilities such as the Central Laser Facility (CLF [7, 8]) Diamond Light Source 
(Diamond [13, 20]), and ISIS Muon and Neutron Source (ISIS [15, 21]) have an operational framework supported 
by their Data Management Policies. This framework governs their Laboratory Information Management 
(LIM) systems and the Data Management System (DMS). The main commonality of these facilities is 
that they use ICAT, an advanced catalogue system that combines LIM and DMS functionalities [16]. ICAT 
is developed by the Scientific Computing Department of the Science and Technology Facilities Council 
(SCD-STFC) and other institutions. The ICAT system contains complementary data for each experiment like 
proposal, PI, Experimenter, Grant(s), device(s), experiment metadata and experiment results. As a result, the 
extended workflows of CLF, Diamond and ISIS can be generalized as shown in Figure 2. 


Publication 


Figure 2. The generic processing workflow of CLF, Diamond, and ISIS (Adapted from [27]). The data reduction 
and analyses tasks highlighted in red are traditionally performed by facilities users, decoupled from facilities. 


As Figure 2 indicates, Data Reduction and Data Analyses tasks are entrusted to the facilities users, i.e., 
scientists who have been awarded experimental time at the facilities. These tasks are the ones which require 
further support, as researchers report that processing and analysing data after the experiment requires 
substantial amount of time and processing resources. The research facilities provide software for collecting 
and formatting the data generated (for instance Mantid [3] and DAWN [4]), however, the researchers still 
need to handle the data and combine it with other data according to their objectives. Researchers rely on 
a combination of data and software resources (own and shared) in their daily work. In this context, there 
are several issues that the researcher needs to handle, such as mastering the use of several types of analysis 
tools including lab equipment, processing software and databases; converting data so that it can be used 
at different stages; and ensure the reproducibility of the results by tracking equipment and software used, 
entry parameters, intermediate results, and versions of completed runs. 


4. THE PROPOSED APPROACH 


The need for supporting users in the processing and analysis of research data has gained higher priority 
because the size and complexity of datasets is constantly increasing. This is the case of XAS analysis with 
the development of higher throughput analysis devices [37] and longer running times. Up to now, researchers 
have managed using interactive software for formatting, processing, reducing, and summarizing experimental 
data. However, researchers are spending longer hours processing and formatting data, which distracts them 
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from their experimental work. The target of the workflows proposed are these time-consuming activities. 
We aim to build on the experience gathered in the adoption of workflow technologies and proposed the 
creation of concrete examples which demonstrate the advantages of using scientific workflow management 
tools when compared to current processing practices. 


4.1 Example for the Workflow Demonstrator 


The explicit definition of processing workflows provides a complete view of the activities performed, the 
software used, and the data consumed and produced. After defining the workflow, its individual tasks can 
then be implemented modularly, allowing the combination, and swapping of components. The processing 
X-ray Absorption Spectroscopy (XAS) data is relevant because of the number of experiments performed and 
the quantities of data produced. Normally a scientist use Artemis and Athena [33] in a well-defined 
structured fashion. Moreover, the XAS processing workflow is well documented and there are several 
examples and tutorials on the use of Artemis and Athena for performing the workflow tasks [31, 33]. Athena 
and Artemis tasks can be scripted in Perl using Demeter. Additionally, there are alternative tools which have 
been proposed and can also be automated through scripting (e.g., Larch [29]). All these considerations 
made the XAS processing workflow the selected target to implement as an example for the demonstrator. 


4.2 The XAS Processing Workflow 


The XAS processing workflow consists of three tasks: Process Raw Data, Normalise Data, and Analyse 
Data. This division of the tasks is derived from the Ravel’s online courses [31, 32], the DAWN tutorials [14] 
and from discussion with coauthors about processing practices. Figure 3A presents an overview of the three 
tasks of the XAS processing workflow. At this level, we can name the software, inputs, and outputs for each 
task of the workflow. The analysis of the workflow can be further refined to identify the sub-tasks within 
each task, providing a modular view of the workflow components. Figure 3B shows a finer grained 
description of the sub-tasks of the workflow. At this level, tasks are better defined as modules which can 
be implemented independently. This representation of the subtasks, including their relationships and 
precedence, including the inputs, resources, and outputs is the stating point for the implementation of the 
workflow. 


4.3 Implementing the Workflow 


After identifying the core tasks to implement, sequence, as well as the expected inputs and outputs from 
each sub-task, it was possible to decide the alternative ways to implement the workflow and to define which 
metrics to use to analyse the performance of the different instances. Three types of workflow configurations 
have been implemented: Manual (Interactive/User driven), Scripted (automated with scripts), and Managed 
(Semi-automated/User supervised). The manual workflow is just a reproduction of the textbook example, 
using the sample data from literature [31, 32] and repeated using an example from our experimental 
colleagues. The manual version is also used to calculate the baseline time for execution of one full cicle 
of the workflow, from raw data to fitted data results. The two versions of the scripted workflow were 
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B. Detailed view 


Figure 3. Overview and detailed view of the XAS processing workflow. 


implemented using Demeter and Larch (one for each). The Demeter version is scripted in Perl and allows 
running the same process as the manual workflow. The main difference is that the interface is text based 
and the operations are presented in a text menu. The Larch version of the scripted workflow is implemented 
using Larch and Jupyter Notebooks (Python). Finally, two managed versions of the workflow were designed 
to be executed using Nextflow [12] in combination with Larch and Demeter. 


The first three versions of the workflow were fully implemented and used in demonstrations while the 
Nextflow managed version is in the process of being implemented for execution on a high-performance 
computing environment. 
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5. EXPERIMENTS AND ANALYSIS 


Each of the three alternative implementations of the workflow was designed and tested using sample 
data from the textbook examples. The comparison of processing times was then made using two datasets 
containing data form actual UKCH experiments. The running times were then averaged and used for 
comparing the proposals and demonstrated to scientists to get feedback on the implementation. 


5.1 Datasets and Experimental Setup 


As described previously, three datasets were used. The first dataset is used for development and testing 
and is the dataset of Ravel’s textbook example [31, 32]. The textbook dataset consists of one file containing 
the crystallographic data for Iron Sulfide (pyrite, FeS2) and a transmission scan of FeS2 taken at room 
temperature at beamline 13BM at the Advanced Photon Source [32]. The other two example datasets consist 
of a nexus file from containg Rh4CO spectral data gathered from the 120 Energy Dispersive EXAFS (EDE) 
beamline at Diamond Lightsource and a crystallographic data file of tetrarhodium dodecacarbony! obtained 
from the Crystallographic Open Database [9, 18]. 


The software used for implementing the workflows included DAWN V.2.16.1, Demeter V. 0.9.26 (which 
includes Artemis and Athena), Larch V. 0.9.47, Perl V. 5.12.3, Python V. 3.6.10, and Jupyter V. 6.0.3. The 
system used for running the experiments was a laptop computer with Windows 10 64-bit operating system, 
Intel Core i5-8250 1.60 GHz Processor, and 8 GB Memory. 


5.2 Overall Comparison of Results 


The three implemented examples of the workflow were individually timed for comparison of potential 
for speeding up the processing and analysis of XAS data. The manual version of the workflow based on the 
textbook example takes about 24 minutes to produce one complete run from raw data to fitted data results. 
This average time was taken from performing the workflow activities manually with ten samples of the 
Rh4CO spectra and then averaging the processing time from start to finish. Using these data, we calculate 
that processing a dataset of 3,790 readings would take about 63 days. The experts in the group consider 
that they can perform one complete run in 10 minutes, which would require approximately 23 days to 
process the 3,790 readings dataset. 


The first scripted version of the workflow uses Demeter and Perl and it allows fast processing in about 
22 hours for a dataset of 3,790 groups (~1 day). This is a considerable improvement from the manual 
workflow. The second scripted version of the workflow uses Larch, Jupyter and Python. It is slower than the 
Demeter version, but still can reduce the processing time to 103 hours (4.3 Days), taking only 20% of the 
time required for manually processing a full dataset. 


The results in Table 1 were obtained using a laptop computer with limited memory and processor. 
In comparison, the initial results of a NextFlow-Larch version of the workflow reduced processing time to 
7 hours and 21 minutes for the largest dataset (4000 groups) when executed in the ARCCA-HPC cluster. 
At this stage, the presentation of the demonstrator to stakeholders indicates that the approach could be 
applied to real life scenarios, as positive reviews and suggestions for improvement indicate. 
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Table 1. Comparison of worflow instances in terms of speed. 


Task Software Time Input Output 

Process raw data DAWN 8 min 1 nexus [.nxs] file 3580 — 4000 files 
E é Normalise data Athena 3 min 1 data [.dat] file 1 Athena file 
= & Analyse data Artemis 21 min 1 Athena [.prj] file 1 Artemis file 
= £ 1 Crystal [.inp/.cif] file 

Novice user processing 1 dataset ~63 days ~ 24 mins to produce 1 fit 

Expert processing 1 dataset ~26 days ~ 10 mins to produce 1 fit 
g Task Software Time Input Output 
& Process raw data DAWN 8 min 1 nexus [.nxs] file 3580 — 4000 files 
ke) = Normalise data Demeter 64 min 500 data [.dat] files 500 Athena files 
E = Analyse data Demeter 21 min 500 Athena [.prj] file 500 Demeter [.dpj] files 
%5 1 Crystal [.inp/.cif] file 500 Fit [.fit] files 
25 500 Log files 
F Processing a dataset with 3,790 ~22 Hours ~ 21 sec. to produce 1 fit 
Q groups 
g Task Software Time Input Output 
2 Process raw data DAWN 8 min 1 nexus [.nxs] file 3580 — 4000 files 
& = Normalise data Larch 8 min 4000 data [.dat] files 4000 Athena files 
o € Analyse data Larch 814 min 500 Athena [.prj] file 500 Demeter [.dpj] files 
a6 1 Crystal [.inp/.cif] file 500 Fit [.fit] files 
25 500 Log files 
5 Processing a dataset with 3,790 ~103 ~ 1.5 min. to produce 1 fit 
S groups Hours 


5.3 Results Analysis 


The three versions of the workflow have been showcased and discussed with researchers in two separate 
occasions, providing valuable feedback, suggestions for improvement and future developments. The 
workflows are not intended to be fully operational processing and analysis tools, instead the functionalities 
and details of the examples is intended to illustrate the benefits of adopting a workflow-oriented design 
approach. The workflows were first demonstrated at a workshop with our coauthors and served to 
demonstrate the feasibility of automating repetitive tasks and provided some recommendations for 
improvements for the workflows. The second presentation of the demonstrator during one of the monthly 
UKCH seminars, exposed the workflows to a larger community and prompted for suggestions and queries 
about implementing other analyses using workflows. 


At this stage we can highlight the advantages and disadvantages of each of the workflow implementations, 
including the ones which are still under development. The scripted and managed versions of the workflows 
are faster for the processing and analysis of data. Moreover, expert users recommended improvements such 
as monitoring output values to determine if the executions should be terminated early. 
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Table 2. Comparison of workflow instances. 


Workflow Software Type Issues Advantages 


Manual Artemis, Athena Manual Slow processing Interactive visual interface 
Limit on number of datasets loaded Fine tuning control 
Individually processing datasets 


Demeter Demeter, Perl Scripted Limited by processing resources Processing large quantities of 
Text interface data 
Larch Larch, Python, Scripted Limited by processing resources Interactive visual interface 
Jupyter Notebook Requires Demeter for one key task Fine tuning control 
Processing large quantities of 
data 
Nextflow 01 Demeter, Perl, Managed Limited by processing resources Processing large quantities of 
Nextflow Text interface data 
Unsupervised execution 
Nextflow 02 Larch, Python, Managed Limited by processing resources Processing large quantities of 
Nextflow Text interface data 


Requires Demeter for one key task Unsupervised execution 


5.3 Relevance of the Canonical Workflow Framework for Research 


The Canonical Workflow Framework for Research (CWFR) is aimed at facilitating the interoperation of 
data management workflows across institutional boundaries [19]. In order to achieve this, the CWRF aims 
to explicitly document the repetitive tasks which are common across diverse institutional data management 
workflows (Figure 4). For the management and exploitation of Catalysis Research data, the CWRF model 
allows stepping back and looking at possibilities for integrating the facilities workflows to the workflows 
of other institutions accessing the facilities. In Figure 5, the diagram shows how the STFC workflow is 
aligned with the workflows of other institutions, and how the tasks of these workflows can be mapped to 


the CWRF tasks. 
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Figure 4. Common tasks identified in the first version of the CWFR (Adapted from [19]). The tasks are not all 
carried by one institution, they are complementary and can be fulfilled by different institutions collaborating in the 
research effort. This is shown in the example provided in Figure 5. 


Data Intelligence 463 


202211.00428v1 


chinaXiv 


ChinaXivA (ERAT 


A Workflow Demonstrator for Processing Catalysis Research Data 


ment + Data) 


t 
i 
i 
' 
i 
i 
1 
1 
1 
1 
i 
i 
1 
1 
i 
i 
i 
i 
1 
i 
i 
I 
I 
1 
1 
i 
1 
I 
i 
i 
I 
i 
1 
1 
t 


CatalysisHub tit 
+` if 4+ 4 
at Harwell,” k 
Research Institutions, Universities, Industry and Publishers eee) 


Figure 5. Parallels between the experimental workflows of STFC facilities and other institutions mapped to CWFR 
Tasks. The fiure shows two workflows, and the activities performed in parallel during research collaborations. The 
upper workflow is the same presented in Figure 2, while the lower one stands for the workflow performed by 
institutions accessing and collaborating with STFC facilities. The numbers in orange represent the tasks identified 
in the CWFR (see Figure 4). The five tasks highlighted in red correspond to activities supported by the type of 
workflows described in this paper. 


The extended workflow for STFC facilities provides the basic scaffolding for integrating with other workflows. 
The main tool underpinning this workflow is the ICAT system [16]. ICAT combines the functionalities of 
Laboratory Information Management (LIM) system and the Data Management System (DMS). ICAT registers 
complementary data for each experiment like proposal, PI, Experimenter(s), Grant(s), device(s), experiment 
metadata and experiment results. ICAT is common to many facilites (UK and Overseas). ICAT supports the 
required management functionalities implementing the Core Scientific MetaData model (CSMD) [26]. This 
model captures metadata about the experiments and datasets produced at the facilities managed by STFC. 
By design, the operational workflows of the facilities rely on the CSMD, for managing experiments from 
the proposal stage to the collection and distribution of experimental data. 


Additionally, the CSMD can enable the alignment of workflows of other institutions by mapping to other 
ontologies, such as PROV-O for tracking provenance [24], DCAT for the description of data objects [2], 
and SPAR [30] and SCHOLIX [5] for linking data objects and publications. 


6. CONCLUSION AND FUTURE WORK 


The implementation of the demonstrator with three versions of the XAS workflow, is a first attempt to 
promote greater usage of the Scientific Workflow approach at UKCH. This first version of demonstrator has 
stimulated the interest for further research on workflow management platforms. We plan to continue the 
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development by completing the nextflow version [12] and possibly adding examples for Galaxy [1], and 
Taverna [39], which provide different benefits. These will be then evaluated in further demonstrations to 
gather more requirements for implementation. 


Looking forward, the UKCH will try to standardize the procedures for describing and implementing other 
processing workflows to support data processing and analysis. For this, we are considering new examples, 
such a Quasi-Elastic Neutron Scattering (QENS) and X-Ray Powder Diffraction (XRD) processing workflow. 
In the longer term, the evaluation of workflow implementation alternatives will help the UKCH in better 
defining the requirements and design constraints to be followed for the development of the Catalysis 
Research Workbench (CRW). 
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